Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration

نویسندگان

  • Atsushi Fujii
  • Tetsuya Ishikawa
چکیده

Cross-language information retrieval (CLIR), where queries and documents are in different languages, has of late become one of the major topics within the information retrieval community. This paper proposes a Japanese/English CLIR system, where we combine a query translation and retrieval modules. We currently target the retrieval of technical documents, and therefore the performance of our system is highly dependent on the quality of the translation of technical terms. However, the technical term translation is still problematic in that technical terms are often compound words, and thus new terms are progressively created by combining existing base words. In addition, Japanese often represents loanwords based on its special phonogram. Consequently, existing dictionaries find it difficult to achieve sufficient coverage. To counter the first problem, we produce a Japanese/English dictionary for base words, and translate compound words on a word-by-word basis. We also use a probabilistic method to resolve translation ambiguity. For the second problem, we use a transliteration method, which corresponds words unlisted in the base word dictionary to their phonetic equivalents in the target language. We evaluate our system using a test collection for CLIR, and show that both the compound word translation and transliteration methods improve the system performance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cross-Language IR at University of Tsukuba: Automatic Transliteration for Japanese, English, and Korean

This paper describes our cross-language information retrieval system for the NTCIR-4 CLIR task. Our system, which follows the query translation approach, uses a compound word translation and transliteration. Transliteration is effective if a query includes foreign words, such as technical terms and proper nouns, spelled out by phonetic alphabets. We apply our method, which was originally propos...

متن کامل

NTCIR-3 Cross-Language IR Experiments at ULIS

This paper describes our retrieval system for the NTCIR-3 CLIR task, focusing on Japanese and English. We integrate query and document translation methods to improve retrieval accuracy, and perform clustering to improve browsing efficiency. In query translation, to derive possible translations for user queries, we use dictionaries and perform a transliteration method, which generates translatio...

متن کامل

Improving Tamil-English Cross-Language Information Retrieval by Transliteration Generation and Mining

While state of the art Cross-Language Information Retrieval (CLIR) systems are reasonably accurate and largely robust, they typically make mistakes in handling proper or common nouns. Such terms suffer from compounding of errors during the query translation phase, and during the document retrieval phase. In this paper, we propose two techniques, specifically, transliteration generation and mini...

متن کامل

KUNLP System for NTCIR-3 English-Korean Cross-Language Information Retrieval

This paper describes KUNLP system for the English-Korean cross-language information retrieval track in NTCIR-3 workshop and some experiments after the workshop. Query translation method based on the bilingual dictionary and the document language corpus was used. To automatically transliterate some proper nouns such as Korean person names, Korean place names, and Korean company names, we have co...

متن کامل

Enhancing English/Arabic CLIR Using Word Collocations and Statistical Translation and Transliteration Resources

In Cross Language Information Retrieval (CLIR), queries in one language retrieve documents in other language(s). This can be done through Query Translation that comes up against Translation/Transliteration challenges like ambiguity as the main problems. In this paper, a comprehensive solution has been introduced for these challenges. 1, 4 powerful English/Arabic Machine Readable Dictionaries (M...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Computers and the Humanities

دوره 35  شماره 

صفحات  -

تاریخ انتشار 2001